QTM 350 Parmigiana Group: Kristie Yip, June Ro, Ryan Lee, Mai Phuong Pham-Huynh, Kexin Guan
AWS Rekognition is a cloud-based SaaS (Software as a Service) computer vision platform that is used to analyze images and videos efficiently. The operation is simple and quick as the users only need to upload an image or videos and the service can identify objects, people, text, scenes, and even inappropriate contents, along with its corresponding confidence level. Users can either use a pre-trained algorithm or train one with custom data.Therefore, there is no machine learning expertise required from the users to use the algorithm. Rekognition also includes a simple API that can quickly access and analyze inputs stored in Amazon S3 buckets. More importantly, the service is always learning from new sources of data to update more of new labels and features, which makes it more accurate and applicable.
Vision impairment significantly changes the way one sees the world. Limited brightness and blurry sights can disrupt the views of everyday-life, putting quality of life at risk. According to World Health Organization, at least 2.2 billion people around the globe have a near or distance vision impairment. Walking without enough sight poses serious danger for falls and collisions, and although the use of mobility aids such as a cane or a dog guide can mitigate the risks, people with vision impairment need special consideration (Manduchi et al., 2002).
Here, we are striving to develop an application for people with vision impairment. This application will detect an object, identify it accurately and tell the person what and how far the object is ahead of them . We aim to test how well AWS Rekognition can capture the object or person correctly even at night or in bad weather. If AWS Rekognition produces satisfactory results, we would be confident to deploy it as the detection algorithm for our application.
We chose some common situations that are part of the daily routine for many people. However, navitaging through these situations is challenging for people with vision impairment.
Each member of our group took 2 photos from each situations to capture the scenes. For example, a photo of a bus driving by would be a scene that we can easily encounter when we use public transportations. We collected 20 photos in total, 10 for each situation.
Then, we photoshopped the images by adjusting the level of the following 3 metrics:
For each image, we adjusted its brightness into 10 levels. Put simply, we produced 10 versions of the same image with each level of brightness ranging from 0.5 to 1.5 with an increment of 0.1. Note that 1.0 is the original image as it is. We repeated the same process for adjusting contrast. As an exception, we adjusted the level of sharpness from 0.0 to 2.0 with an increment of 0.2 because changing the levels by an increment of 0.1 did not yield significant change in sharpness.
Through this transformation, we intended to produce different versions of the same images that best represent situations that resemble impaired vision such as night time, foggy days, etc.
Although adjusting these metrics would not perfectly represent different situations, we strived to compare how AWS Rekognition changes its response to the adjusted images.
Below is the code that we used to transform the images.
The following code iterates through all images in the source directory in the form imagename-orig.jpg. Then, it transforms all the images from source directory using transformData function and saves into destination dictionary.
The code below will create two directories to store images.
Then, for each of the 3 criteria, we divided the 10 versions of each image into 5 versions on the lower ends of each range (groups $X_{brightness}, X_{constrast}, X_{sharpness}$), and 5 versions on the upper ends of each range (groups $Y_{brightness}, Y_{constrast}, Y_{sharpness}$). For instance, $X_{brightness}$ will contain the images that were edited to the lower 5 brightness values, and $𝑌_{𝑏𝑟𝑖𝑔h𝑡𝑛𝑒𝑠𝑠}$ will content the images that were edited to the higher 5 brightness values.
Finally, the original and transformed photos were uploaded into the S3 bucket created for this project. An example adjustment in brightness for a photo is shown below.
We hypothesize that AWS Rekognition would not be effective at detecting objects or people when the brightness, contrast, and sharpness are adjusted. Our rationale is that AWS Rekognition would not have trained the algorithm with images with differing levels of the metrics that we have adjusted. We also expect that contrast would have the most significant effect on the confidence level. Metrics such as contrast disproportionately influence the relative RGB values of an image. Therefore, we hypothesize that changing the contrast would influence the outcome of the Rekognition model more than the other effects would.
Among the services offered by AWS Rekognition, we used 'Object and Scene Detection'. This service takes an image as its input and returns the names of the object or person it identifies and a value that indicates the level of confidence for the identification. Below is an architecture diagram explaining the process.
We ran AWS Rekognition’s “detect label” function with all the images in groups X and Y. Then, we calculated the average confidence level of the labels. We considered the group-X versions of each image together, and did likewise with the group-Y versions. We calculated the average of the function’s confidence level in applying a given label to the group-X and the group-Y versions of each image. So for an image of food, “detect label” applied the label “food” with different confidences across all the versions of each image, and for images of crosswalks, “detect label” applied the label “road” with different confidences. We calculated the average of those different confidences to see how group X’s averages differed from group Y’s for each image.
For the case above, group X’s average confidence level will be 0.94, and group Y’s is 0.97. We’ll repeat these steps across all versions of all the images we collected, and run hypothesis tests to compare if the means of groups X and Y are different across the three image criteria (brightness, contrast, sharpness).
Finally, we will use Hypothesis Testing, specifically the t-test, to test for the equality of sample means among groups.
For example, one of the hypotheses we’ll test is to determine whether confidence level varies depending on brightness. Hence, for the null hypothesis, we suppose that while holding other things fixed, the confidence level does not vary depend on brightness. The test can be set up as the following:
$$ H_0: X_{brightness} = Y_{brightness} \\ H_1: X_{brightness} \neq Y_{brightness} $$There are a total of 50 photos related to the Road label and 80 photos related to the Food label. For each of the three features, $brightness$, $contrast$, and $sharpness$, we divided all the photos into two groups
We then conducted the Student t-test for two samples to check for equality between the two samples' means at $5\%$ significance level using the function ttest_ind from the scip.stats packages. Hence, we will fail to reject the null hypothesis ($H_0$: Confidence level varies depending on features) if p-value is larger than 0.05. In this case, we have enough evidence to conclude that AWS Rekognition works well regardless of external factors such as lighting or weather.
Following is the code for doing hypothesis testing. We first imported necessary librarie and packages and loaded data set.
Whether we are visually impaired or not, we will walk past at least one sidewalk during their day. It is essential for AWS to identify the 'Road' label correctly.
From the result above, we got the test statistic $t \approx 1.429$ with the corresponding p-value $0.160$, which is larger than the significance level $0.05$. Thus, we failed to reject the null hypothesis and believe that the alternative hypothesis is true. We believe that AWS Rekognition can detect the Road label well despite of lighting matters.
Here, we got the test statistic $t \approx -0.715$ with the p-value $0.478$, which is larger than the significance level $0.05$. Thus, we failed to reject the null hypothesis and believe that the alternative hypothesis is true. We believe that AWS Rekognition can detect the Road label well despite of contrasting matters.
Similar to the two features above, for sharpness, we got the test statistic $t \approx 0.181$ with the corresponding p-value $0.857$, which is larger than the significance level $0.05$. Thus, we failed to reject the null hypothesis and believe that the alternative hypothesis is true. We believe that AWS Rekognition can detect the Road label well despite of sharpness matters.
Eating is an essential part for everyone’s daily life. However, vision-impaired people always find great difficulty in things such as identifying the place of their plates and cutleries. Hence, detecting where the “food” is plays an important role in aiding visually impaired users with their food consumption. In this project, we will focus only on the Food label as it is the most common label among all images.
From the result above, we got the test statistic $t \approx -1.300$ with the corresponding p-value $0.197$, which is larger than the significance level $0.05$. Thus, we failed to reject the null hypothesis and believe that the alternative hypothesis is true. We believe that AWS Rekognition can detect the Food label well despite of lighting matters.
Here, we got the test statistic $t \approx -0.518$ with the p-value $0.606$, which is larger than the significance level $0.05$. Thus, we failed to reject the null hypothesis and believe that the alternative hypothesis is true. We believe that AWS Rekognition can detect the Food label well despite of contrasting matters.
Similar to the two features above, for sharpness, we got the test statistic $t \approx -0.774$ with the corresponding p-value $0.441$, which is larger than the significance level $0.05$. Thus, we failed to reject the null hypothesis and believe that the alternative hypothesis is true. We believe that AWS Rekognition can detect the Food label well despite of sharpness matters.
We produced 3 scatterplots for each metric plotted against confidence levels. For all three metrics, the results did not exhibit any trend.
To summarize, here is the table of all the p-values from our hypothesis tests above.
\begin{array}{|c|c|c|c|c|}\hline\\ \\ & brightness & contrast & sharpness & \text{Reject }H_0 \\ \hline\\ \\ \text{Road} & 0.160 & 0.478 & 0.857 & no \\ \hline\\ \\ \text{Food} & 0.197 & 0.606 & 0.441 & no \\ \hline \end{array}
As shown in the table above, none of our p-values were statistically significant and we did not have enough evidence to reject the null hypothesis. Therefore, the hypothesis tests met our initial expectation that AWS Rekognition works well in detecting scenes and objects despite adjustments or environmental factors such as lightings and weathers. However, the results did not support our hypothesis that contrast would have the strongest effect on the confidence level. Although none of the p-values were statistically significant, brightness had the smallest p-value among all metrics and therefore proved that it has the strongest effect.
This finding motivates us to develop our model further by training the algorithm with images of a wider range of daily activities to increase its applicability. In the future, we desire to utilize Text-To-Speech (TTS), such as AWS Polly, to communicate with users with vision impairment about the information obtained from image recognition. For instance, the application would detect the object that the user has approached, accurately identify the labels and vocalize it to let the user know how far away it is so that he or she can mitigate the risk of collisions or falls.